This is an R Markdown Notebook. When you execute code within the notebook, the results appear beneath the code.

Try executing this chunk by clicking the Run button within the chunk or by placing your cursor inside it and pressing Ctrl+Shift+Enter.

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(base)
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
library(sf)
## Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(tmap)
library(maptools)
## Loading required package: sp
## Checking rgeos availability: TRUE
library(gganimate)
filename <- "covid_19_clean_complete.csv"
PopulationChina <- "AnnualbyProvince.csv"
df_covidTemp <- read_csv(filename)
## 
## ── Column specification ───────────────────────────────────────────────────────────────────
## cols(
##   `Province/State` = col_character(),
##   `Country/Region` = col_character(),
##   Lat = col_double(),
##   Long = col_double(),
##   Date = col_date(format = ""),
##   Confirmed = col_double(),
##   Deaths = col_double(),
##   Recovered = col_double(),
##   Active = col_double(),
##   `WHO Region` = col_character()
## )
df_popChina <- read_csv(PopulationChina) %>% select("Province/State", "Pop (k)")
## 
## ── Column specification ───────────────────────────────────────────────────────────────────
## cols(
##   `Province/State` = col_character(),
##   `Pop (k)` = col_double(),
##   `2018` = col_double(),
##   `2017` = col_double(),
##   `2016` = col_double(),
##   `2015` = col_double(),
##   `2014` = col_double(),
##   `2013` = col_double(),
##   `2012` = col_double(),
##   `2011` = col_double(),
##   `2010` = col_double()
## )
mydat = readShapePoly("bou2_4p.shp")
## Warning: readShapePoly is deprecated; use rgdal::readOGR or sf::st_read
df_covidTemp #%>% knitr::kable()
df_popChina
df_covidChina <- df_covidTemp %>%
  filter(`Country/Region` == "China") %>% 
  inner_join(df_popChina, by = "Province/State") 
df_covidChina
df_covidChina %>%
 filter(Date == max(Date)) %>%
  mutate(Confirmedper100k = Confirmed / `Pop (k)` / 100, 
         date_num = as.integer(Date)) %>%
  ggplot() +
    geom_polygon(
    data = fortify(mydat),
    aes(x = long, y = lat, group = id), 
               colour = "grey",
               fill = NA) +
  theme_grey() + 
  coord_map() +

  geom_point(
    data = . %>% filter(`Province/State` != "Hubei"), 
    aes(x = Long, y = Lat, size = Confirmed, alpha = Confirmedper100k),
    color = "red") +
  ggtitle("Cumulative Number of Confirmed COVID-19 Cases in China")
## Regions defined for each Polygons

PlotChina <-
df_covidChina %>%
  filter(Date <= "2020-04-20") %>%
  mutate(Confirmedper100k = Confirmed / `Pop (k)` / 100, 
         date_num = as.integer(Date)) %>%
  ggplot() +
    geom_polygon(
    data = fortify(mydat),
    aes(x = long, y = lat, group = id), 
               colour = "grey",
               fill = NA) +
  theme_grey() + 
  coord_map() +

  geom_point(
    data = . %>% filter(`Province/State` != "Hubei"), 
    aes(x = Long, y = Lat, size = Confirmed, alpha = Confirmedper100k),
    color = "red") 
## Regions defined for each Polygons
PlotChina + transition_time(date_num) +
  labs(title = "Date: {frame_time - 18282}")

Observations: This animation shows the increasing number of COVID-19 cases from January 22th to April 8th, which represents the start of transmission, the highest outbreak and the end of community transmission of COVID-19 in China. The data is represented for each province in China, excluding the outlier Hubei, where the outbreak first started.

The size of each data point represents the absolute number of confirmed cases in each province, while the color of each datapoint represents the number of confirmed cases per 100k population. By examining only the number of confirmed cases, it seems like provinces on shoreline in southern China are most severely impacted by COVID-19. However, the population of provinces needs to be taken into account. Most coastal provinces have a much larger population than the inland part. Dividing confirmed cases by total population yields the confirmed per 100k people, which gives a more accurate representation of how severe the transmission is. This is shown on the graph by a color scale. The darker color shows higher confirmed cases per 100k.

It can be seen that most dark red dots are located on the eastern part of China. The highest infection rate occurs in Heilongjiang, which is the most northern part of China.

Multiple facts can contribute to this pattern, including temperature, humidity, intra-state travel, oversea transmission, and communication with Hubei, where the outbreak starts.

A close examination of the animation shows, the provinces around Wuhan are more severely impacted at the starting of this outbreak, while the situation gets worsen in northern provinces like Heilongjiang, as well as increase.

Thus, I conclude that COVID-19 has higher transmitting rate on the shoreline and in colder environment.

df_covidChina %>%
  filter(`Province/State` != "Hubei") %>%
  ggplot() +
  geom_line(aes(x = Date, y = Confirmed, color = `Province/State`)) +
  labs(title = "COVID Outbreak starts at different time in each province in China")

Observations: This figure shows a clear difference that difference provinces have outbreaks at different time period. While most of the provinces near Hubei are affected in January and Fabuary, northern provinces like Heilongjiang and Beijing are affected later, which is a sign of more severe community transmission.